Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-33958: secret re-creation scenario for externalCertificate with active informer #614

Merged
merged 10 commits into from
Jan 6, 2025

Conversation

chiragkyal
Copy link
Member

@chiragkyal chiragkyal commented Jul 15, 2024

This PR:

  • Handles secret re-creation scenarios for externalCertificate where a previously deleted secret is added again.

  • Introduces a deletedSecrets map in RouteSecretManager to track routes for which the associated secret was deleted. This helps differentiate between new secret creation and recreation of a previously deleted secret.

  • Refactors secret handling logic in generateSecretHandler to address these changes.

    • AddFunc handles the secret recreations. Upon detecting a recreated secret, it revalidates the route, updates its TLS configuration, and triggers a Modified event for the next plugins.
    • DeleteFunc does not call UnregisterRoute(), instead tracks deleted secrets.
  • Improves the logic for handling Modified events for routes with externalCertificate to avoid blindly unregistering and re-registrations with the secret manager.

  • Makes stopping of the informer self-contained to the RouteSecretManager plugin.

    • RemoveRoute (templateRouter) should not call UnregisterRoute() to avoid removal of informer when RouteSecretManager calls the next plugin chain with watch.Deleted event.
    • Informer will be stopped only during the following cases
  • Adds unit tests to cover the new logic and scenarios.

    New Helper Functions:

  • validate() to encapsulate the externalCertificate validation logic.

  • populateRouteTLSFromSecret() to update a route's in-memory TLS configuration from the referenced secret.

  • unregister() to remove the route's registration with the secret manager and clean up references in deletedSecrets.

Part of : https://issues.redhat.com//browse/OCPBUGS-33958

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2024
@openshift-ci-robot openshift-ci-robot added the jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. label Jul 15, 2024
Copy link
Contributor

openshift-ci bot commented Jul 15, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 15, 2024
@openshift-ci-robot
Copy link
Contributor

@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 15, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2024
@chiragkyal chiragkyal force-pushed the active-informer branch 6 times, most recently from 41a1750 to f51a3f2 Compare July 23, 2024 07:48
@chiragkyal
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 25, 2024
@openshift-ci-robot
Copy link
Contributor

@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from lihongan July 25, 2024 09:09
@chiragkyal chiragkyal force-pushed the active-informer branch 4 times, most recently from 70121a3 to f655bd2 Compare August 28, 2024 07:48
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2024
@chiragkyal chiragkyal marked this pull request as ready for review August 28, 2024 07:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 28, 2024
@openshift-ci openshift-ci bot requested review from frobware and Miciah August 28, 2024 07:50
@openshift-ci-robot openshift-ci-robot removed the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 28, 2024
@openshift-ci-robot
Copy link
Contributor

@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.17.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

This PR:

  • Handles secret re-creation scenarios for externalCertificate where a previously deleted secret is added again.

  • Introduces a deletedSecrets map in RouteSecretManager to track routes for which the associated secret was deleted. This helps differentiate between new secret creation and recreation of a previously deleted secret.

  • Refactors secret handling logic in generateSecretHandler to address these changes.

  • AddFunc handles the secret recreations. Upon detecting a recreated secret, it revalidates the route, updates its TLS configuration, and triggers a Modified event for the next plugins.

  • DeleteFunc does not call UnregisterRoute(), instead tracks deleted secrets.

  • Improves the logic for handling Modified events for routes with externalCertificate to avoid blindly unregistering and re-registrations with the secret manager.

  • This change is also required to avoid always calling UnregisterRoute(), which leads to the stopping of the informer.

  • Depends on OCPBUGS-33958: retrieve registered secret name from secretManager library-go#1740

  • Makes stopping of the informer self-contained to the RouteSecretManager plugin.

    • RemoveRoute (templateRouter) should not call UnregisterRoute() to avoid removal of informer when RouteSecretManager calls the next plugin chain with watch.Deleted event.
    • Informer will be stopped only during the following cases
  • Adds unit tests to cover the new logic and scenarios.

    New Helper Functions:

  • validate() to encapsulate the externalCertificate validation logic.

  • populateRouteTLSFromSecret() to update a route's in-memory TLS configuration from the referenced secret.

  • unregister() to remove the route's registration with the secret manager and clean up references in deletedSecrets.

Part of : https://issues.redhat.com//browse/OCPBUGS-33958

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@chiragkyal
Copy link
Member Author

Thanks for doing the test. I also tried on a live cluster.

I think there is a race condition happening while updating the status of the route. The secret handler status update is racing with the status update of the plugin re-trigger.

The top level plugin has a lock :

func (c *RouterController) HandleRoute(eventType watch.EventType, obj interface{}) {
route := obj.(*routev1.Route)
c.lock.Lock()
defer c.lock.Unlock()
c.processRoute(eventType, route)
c.Commit()
}

which means all routes are getting processed serially, unlike secret handler.

As a solution, I think the StatusAdmitter plugin can have its own Mutex to lock the methods like RecordRouteUpdate, RecordRouteRejection, HandleRoute, etc., so that we can avoid race conditions.

@chiragkyal
Copy link
Member Author

@alebedev87 @Miciah As per our discussion over sync call, I've updated the contention tracker to ignore the status update done by the secret manager. Please take a final look at the PR, and let's try to get this merged. Thanks!

@chiragkyal
Copy link
Member Author

I did some testing on a real cluster with multiple routes using the same secret, and after the latest changes, the transition of the status were as expected and consistent through out the routes.

$ oc get route -n sandbox -w
NAME        HOST/PORT                                                                 PATH   SERVICES          PORT    TERMINATION   WILDCARD
example     example-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com            hello-openshift   <all>   edge          None
example1    example1-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example10   example10-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com          hello-openshift   <all>   edge          None
example11   example11-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com          hello-openshift   <all>   edge          None
example2    example2-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example3    example3-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example4    example4-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example5    example5-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example6    example6-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example7    example7-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example8    example8-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None


// Secret Deleted


example5    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example2    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example6    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example7    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example3    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example8    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example     ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example11   ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example1    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example10   ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example4    ExternalCertificateSecretDeleted                                                 hello-openshift   <all>   edge          None
example5    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example2    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example6    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example7    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example3    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example8    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example     ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example11   ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example1    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example10   ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None
example4    ExternalCertificateValidationFailed                                              hello-openshift   <all>   edge          None

// Secret Re-Created

example5    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example10   ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example11   ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example     ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example4    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example1    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example7    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example2    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example8    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example6    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example3    ExternalCertificateSecretRecreated                                               hello-openshift   <all>   edge          None
example5    example5-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example10   example10-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com          hello-openshift   <all>   edge          None
example11   example11-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com          hello-openshift   <all>   edge          None
example     example-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com            hello-openshift   <all>   edge          None
example4    example4-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example1    example1-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example7    example7-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example2    example2-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example8    example8-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example6    example6-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None
example3    example3-sandbox.apps.ckyal-20241216-486d8f62.devcluster.openshift.com           hello-openshift   <all>   edge          None

@alebedev87
Copy link
Contributor

/approve

Thanks for the testing, we should be good now.

Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 19, 2024
@chiragkyal
Copy link
Member Author

tide/merge-method-squash

@chiragkyal
Copy link
Member Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 19, 2024
Comment on lines +297 to +299
if ignoreIngressConditionReason.Has(a.Reason) || ignoreIngressConditionReason.Has(b.Reason) {
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed today that this condition may be broader than necessary, but we couldn't think of a realistic scenario in which it would cause problems. However, we may need to revisit this condition if we ever observe any strange behavior with respect to status updates for routes with external certificates during, say, a rolling update of a router deployment.

Comment on lines +93 to +94
// - If the external certificate has changed, it unregisters the old one and registers the new one.
// - If the external certificate has not changed, it revalidates and updates the in-memory TLS certificate and key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a little unclear. Presumably by "the external certificate has changed" or "the external certificate has not changed", you mean that the external certificate secret reference has or has not changed. That is, if the secret reference changes, you need to track the new referent, but if the reference is the same, you only need to check whether the referent secret's contents have changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean that the external certificate secret reference has or has not changed

Yes exactly, if the secret reference is changed to a new secret, we will track the new secret. And if the secret name is unchanged, we will check the contents of the secret.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming my understanding! If the comment is unclear, we should try to make it clearer. If not in this PR, maybe in the next one that touches this code.

Comment on lines +237 to +240
// - Handles secret recreation: If a secret is recreated (created after being
// deleted), it fetches the latest route object associated with it using the
// RouteLister and updates the route's status to trigger a full
// re-evaluation of the route by the entire plugin chain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a gap here. Consider the following scenario:

  1. Start with a router deployment.
  2. Create a route with an external certificate reference to secret.
  3. Delete the secret.
  4. Restart the router deployment's pods (for example, for a rolling update).
  5. Recreate the secret.

The new router pods will reject the route at step 4 and will miss the secret recreate event at step 5, right?

It might not be feasible to close this gap, but I want to raise this concern to make sure I understand correctly and to note that it is a potential issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is likely possible because the route pod is managing the lifecycle of the secret manager and the secret informers. Restarting the router deployment's pods will kill the previously stored watch, and new pods will not register the new watch because the secret won't be there. So, yeah, in this scenario, we will miss the secret recreate event.

It might not be feasible to close this gap, but I want to raise this concern to make sure I understand correctly and to note that it is a potential issue.

I agree on this, given the secret manager's dependency on the router pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see something related to this in the EP as well.

The router will bootstrap the secret monitor to ensure it can keep the certificate up-to-date. This means the router pod will maintain active watches for every secret that is referenced by a route.

xRef: https://github.com/openshift/enhancements/blob/master/enhancements/ingress/route-secret-injection-for-external-certificate-management.md#implementation-detailsnotesconstraints-optional

Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

@chiragkyal: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn 62daef4 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Miciah
Copy link
Contributor

Miciah commented Jan 2, 2025

Just an observation: It occurs to me that using --update-status=false or ROUTER_UPDATE_STATUS=false would prevent the router controller from updating the route status and thus prevent the secret manager from triggering reconciliation. However, we don't expose an API for --update-status=false or ROUTER_UPDATE_STATUS=false. We do use --update-status=false in some openshift/origin tests, but it would only become an issue if we had a test that used both external certificates and --update-status=false.

@@ -36,6 +37,7 @@ const (
// RouteStatusRecorder is an object capable of recording why a route status condition changed.
type RouteStatusRecorder interface {
RecordRouteRejection(route *routev1.Route, reason, message string)
RecordRouteUpdate(route *routev1.Route, reason, message string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a future clean-up, HandleRoute could use this new method.

@Miciah
Copy link
Contributor

Miciah commented Jan 2, 2025

Excellent work! Thank you for your patience and effort with this PR!
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 2, 2025
@chiragkyal
Copy link
Member Author

Excellent work! Thank you for your patience and effort with this PR!

Finally!! Thank you very much for the approval.

@openshift-merge-bot openshift-merge-bot bot merged commit 7a688b0 into openshift:master Jan 6, 2025
8 of 9 checks passed
@openshift-ci-robot
Copy link
Contributor

@chiragkyal: Jira Issue OCPBUGS-33958: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33958 has been moved to the MODIFIED state.

In response to this:

This PR:

  • Handles secret re-creation scenarios for externalCertificate where a previously deleted secret is added again.

  • Introduces a deletedSecrets map in RouteSecretManager to track routes for which the associated secret was deleted. This helps differentiate between new secret creation and recreation of a previously deleted secret.

  • Refactors secret handling logic in generateSecretHandler to address these changes.

  • AddFunc handles the secret recreations. Upon detecting a recreated secret, it revalidates the route, updates its TLS configuration, and triggers a Modified event for the next plugins.

  • DeleteFunc does not call UnregisterRoute(), instead tracks deleted secrets.

  • Improves the logic for handling Modified events for routes with externalCertificate to avoid blindly unregistering and re-registrations with the secret manager.

  • This change is also required to avoid always calling UnregisterRoute(), which leads to the stopping of the informer.

  • Depends on OCPBUGS-33958: retrieve registered secret name from secretManager library-go#1740

  • Makes stopping of the informer self-contained to the RouteSecretManager plugin.

    • RemoveRoute (templateRouter) should not call UnregisterRoute() to avoid removal of informer when RouteSecretManager calls the next plugin chain with watch.Deleted event.
    • Informer will be stopped only during the following cases
  • Adds unit tests to cover the new logic and scenarios.

    New Helper Functions:

  • validate() to encapsulate the externalCertificate validation logic.

  • populateRouteTLSFromSecret() to update a route's in-memory TLS configuration from the referenced secret.

  • unregister() to remove the route's registration with the secret manager and clean up references in deletedSecrets.

Part of : https://issues.redhat.com//browse/OCPBUGS-33958

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@chiragkyal chiragkyal deleted the active-informer branch January 7, 2025 05:32
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-haproxy-router-base
This PR has been included in build ose-haproxy-router-base-container-v4.19.0-202501070436.p0.g7a688b0.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-haproxy-router
This PR has been included in build openshift-enterprise-haproxy-router-container-v4.19.0-202501070436.p0.g7a688b0.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants